Clusters of Countries

Data Used

The data used is Population Growth, Fertility and Mortality Indicators.csv, tells about the number of some variables related to population fertility and mortality of each country around the world.

We have some variables from the data, and they are :

  • T03 The country code

  • Population.growth.and.indicators.of.fertility.and.mortality The country list

  • X The year column

  • X.1 Variable which contains some indicators, this variable is going to be spread to some variables.

  • X.2 The values of the observations.

  • X.3 Footnotes

  • X.4 Data source

The Goal

Assume that we are going to classify countries listed based on the indicators contained in the data.

The Flow

  1. Libraries Importing and Data Preparation.

  2. Exploratory Data Analyst

  3. PCA Transformation.

  4. Biplotting and Interpretation.

Libraries Importing and Data Preparation

Libraries Used

Data Importing

## 'data.frame':    4979 obs. of  7 variables:
##  $ T03                                                        : chr  "Region/Country/Area" "1" "1" "1" ...
##  $ Population.growth.and.indicators.of.fertility.and.mortality: chr  "" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
##  $ X                                                          : chr  "Year" "2005" "2005" "2005" ...
##  $ X.1                                                        : chr  "Series" "Population annual rate of increase (percent)" "Total fertility rate (children per women)" "Infant mortality for both sexes (per 1,000 live births)" ...
##  $ X.2                                                        : chr  "Value" "1.3" "2.6" "49.1" ...
##  $ X.3                                                        : chr  "Footnotes" "Data refers to a 5-year period preceding the reference year." "Data refers to a 5-year period preceding the reference year." "Data refers to a 5-year period preceding the reference year." ...
##  $ X.4                                                        : chr  "Source" "United Nations Population Division, New York, World Population Prospects: The 2017 Revision, last accessed June 2017." "United Nations Population Division, New York, World Population Prospects: The 2017 Revision; supplemented by da"| __truncated__ "United Nations Statistics Division, New York, \"Demographic Yearbook 2015\" and the demographic statistics data"| __truncated__ ...
  • We only need some variables to process the data, the last 2 columns and the first column will be eliminated

  • There is a year column (from 2000 to 2016 ), most of the countries only have values for 2005, 2010, and 2015.

  • The X.1 contains 8 indicators, we’re going to spread them into their own column

Data cleaning

  • In the chunk below we’re going to remove the last 2 variables and filter the year, we only need the 2015 data to interpret the latest condition of each country.
Code Country year inf.mort life.exp.both life.exp.female life.exp.male maternal.mortality.ratio pop.increase tot.fertil.rate
1 Total, all countries or areas 2015 35.0 70.8 73.1 68.6 216 1.2 2.5
100 Bulgaria 2015 8.3 74.3 77.8 70.8 11 -0.6 1.5
104 Myanmar 2015 45.0 66.0 68.3 63.7 178 0.9 2.3
108 Burundi 2015 77.9 56.1 58.0 54.2 712 3.0 6.0
11 Western Africa 2015 70.5 54.7 55.6 53.9 NA 2.7 5.5
112 Belarus 2015 3.6 72.1 77.7 66.5 4 0.0 1.6

Country = Country list ;
inf.mort = Infant mortality for both sexes (per 1,000 live births) ;
life.exp.both = Life expectancy at birth for both sexes (years) ;
life.exp.male = Life expectancy at birth for males (years) ;
life.exp.female = Life expectancy at birth for females (years) ;
maternal.mortality.ratio = Maternal mortality ratio (deaths per 100,000 population) ;
pop.increase = Population annual rate of increase (percent) ;
tot.fertil.rate = Total fertility rate (children per women)

NA checking

column NA
Code 0
Country 0
year 0
inf.mort 31
life.exp.both 31
life.exp.female 29
life.exp.male 29
maternal.mortality.ratio 73
pop.increase 0
tot.fertil.rate 29
There are so many NAs in the data, it means that not all country listed have the data we need.
* We’re going to replace the NAs to the average value of each variable/indicator.
Code Country year inf.mort life.exp.both life.exp.female life.exp.male maternal.mortality.ratio pop.increase tot.fertil.rate
1 Total, all countries or areas 2015 35.0 70.8 73.1 68.6 216.0000 1.2 2.5
100 Bulgaria 2015 8.3 74.3 77.8 70.8 11.0000 -0.6 1.5
104 Myanmar 2015 45.0 66.0 68.3 63.7 178.0000 0.9 2.3
108 Burundi 2015 77.9 56.1 58.0 54.2 712.0000 3.0 6.0
11 Western Africa 2015 70.5 54.7 55.6 53.9 162.1842 2.7 5.5
112 Belarus 2015 3.6 72.1 77.7 66.5 4.0000 0.0 1.6


There is an odd thing on the data as we replace the NA with the average number of each column. There are some rows/countries which have no observation value or only have 1 or 2 value for their indicator and we have filled them with the average values and it’s not supposed to be like that. We supposed to eliminate them.

  • eliminating some rows

I create a vector that indicates whether a rows’ values are mostly the avg values of each column or not. If it is, eliminate the column.

## 'data.frame':    235 obs. of  8 variables:
##  $ Country                 : chr  "Total, all countries or areas" "Bulgaria" "Myanmar" "Burundi" ...
##  $ inf.mort                : num  35 8.3 45 77.9 70.5 3.6 29.9 27.7 67.5 4.7 ...
##  $ life.exp.both           : num  70.8 74.3 66 56.1 54.7 72.1 67.6 75.3 56.4 81.8 ...
##  $ life.exp.female         : num  73.1 77.8 68.3 58 55.6 77.7 69.6 76.5 57.7 83.8 ...
##  $ life.exp.male           : num  68.6 70.8 63.7 54.2 53.9 66.5 65.5 74.1 55.1 79.7 ...
##  $ maternal.mortality.ratio: num  216 11 178 712 162 ...
##  $ pop.increase            : num  1.2 -0.6 0.9 3 2.7 0 1.6 2 2.7 1 ...
##  $ tot.fertil.rate         : num  2.5 1.5 2.3 6 5.5 1.6 2.7 3 5 1.6 ...

Continent Column

I think by giving the Continent column, we’re going to have some more insights, so let’s just do it.
Country Continent
Total, all countries or areas NA
Bulgaria Europe
Myanmar Asia
Burundi Africa
Western Africa NA
Belarus Europe


Some rows cannot be defined by its continent and all of them are not even a country actually. They are just regions or certain areas of the continent.

Our observations are countries so we wil just eliminate rows that represent some areas or regions.

Country Continent
Total, all countries or areas ?
Western Africa ?
Central America ?
Eastern Africa ?
Asia ?
Central Asia ?
Western Asia ?
Northern Africa ?
Europe ?
Eastern Europe ?
Northern Europe ?
Western Europe ?
Other non-specified areas ?
Middle Africa ?
Southern Africa ?
Africa ?
Sub-Saharan Africa ?
Northern America ?
Caribbean ?
Eastern Asia ?
Southern Asia ?
South-eastern Asia ?
Southern Europe ?
Latin America & the Caribbean ?
South America ?
Australia and New Zealand ?
Melanesia ?
Micronesia ?
Polynesia ?
South-central Asia ?
Channel Islands ?
Oceania ?
  • We better assign the Country as rownames instead.
    inf.mort life.exp.both life.exp.female life.exp.male maternal.mortality.ratio pop.increase tot.fertil.rate Continent
    8.3 74.3 77.8 70.8 11 -0.6 1.5 Europe
    45.0 66.0 68.3 63.7 178 0.9 2.3 Asia
    77.9 56.1 58.0 54.2 712 3.0 6.0 Africa
    3.6 72.1 77.7 66.5 4 0.0 1.6 Europe
    29.9 67.6 69.6 65.5 161 1.6 2.7 Asia
    27.7 75.3 76.5 74.1 140 2.0 3.0 Africa

    Now the data is ready to be proceed.

Exploratory Data Analyst


From the plot above we can conclude that :

Life Expectantion of the World

{% include html_plot/g.html %}


* Africa dominates the low life expectantion area but Europe are mostly on the high area of life expectancy . The rest are spread from the middle to the high.

  • Usualy the countries which infant mortality is high have less life expectantion. The infants die and the life expectantion is lower than other countries, Africa dominates this area and Europe is on the other side.

  • The higher fertility rate the lower life expectancy,Africa dominates this area and Europe is on the other side.

Total Fertility of the World


* Africa dominates the area which total fertility rate is high, means that Africans are “productive”.

  • it’s kinda make sense countries with low fertility rate have low infant mortality number.

  • Usualy the countries which total fertility is high have low life expectancy.

  • Countries with high fertility rate tend to have high maternal mortality ratio and this still dominated by African countries.

Population Increase of The world


* Europe has low infant mortality number but also low population increase which is rational i think.

  • Most African countries and some Asian country have high pop increase and high infant mortality, it’s not really good though, it seems like they produce babies as much as possible but can’t really keep them alive until adult.

  • Some Asian countries even keep their infant mortality low but still their population increase greatly. And they are the “oil well” of the world.

  • The higher total fertility rate, the higher population increase.

Data Clustering

Data Scalling

Scaled data is needed to perform data clustering.

  • Before Scaling
##     inf.mort    life.exp.both   maternal.mortality.ratio  pop.increase   
##  Min.   : 1.6   Min.   :49.40   Min.   :  3.0            Min.   :-2.300  
##  1st Qu.: 6.9   1st Qu.:65.80   1st Qu.: 16.5            1st Qu.: 0.400  
##  Median :17.1   Median :73.00   Median : 76.0            Median : 1.300  
##  Mean   :25.7   Mean   :71.27   Mean   :162.2            Mean   : 1.401  
##  3rd Qu.:42.1   3rd Qu.:76.90   3rd Qu.:187.5            3rd Qu.: 2.300  
##  Max.   :94.4   Max.   :83.40   Max.   :882.0            Max.   : 6.600  
##  tot.fertil.rate    Continent 
##  Min.   :1.200   Africa  :57  
##  1st Qu.:1.800   Americas:42  
##  Median :2.400   Asia    :50  
##  Mean   :2.853   Europe  :40  
##  3rd Qu.:3.750   Oceania :14  
##  Max.   :7.400
  • After Scaling
##     inf.mort       life.exp.both     maternal.mortality.ratio  pop.increase    
##  Min.   :-1.0369   Min.   :-2.6949   Min.   :-0.7804          Min.   :-2.7139  
##  1st Qu.:-0.8089   1st Qu.:-0.6741   1st Qu.:-0.7142          1st Qu.:-0.7343  
##  Median :-0.3700   Median : 0.2131   Median :-0.4225          Median :-0.0744  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000          Mean   : 0.0000  
##  3rd Qu.: 0.7055   3rd Qu.: 0.6937   3rd Qu.: 0.1243          3rd Qu.: 0.6588  
##  Max.   : 2.9556   Max.   : 1.4947   Max.   : 3.5298          Max.   : 3.8116  
##  tot.fertil.rate  
##  Min.   :-1.1726  
##  1st Qu.:-0.7470  
##  Median :-0.3215  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6359  
##  Max.   : 3.2245

Optimal K value


The elbow method shows that the optimum K value is 2. But i think we should try 3 as well since 2 clusters will not give us much information.

K-means

  • k-means modeling

  • cluster distribution

If we divide the data into two clusters, the composition of each cluster would be :
Cluster Freq
1 153
2 50

But if we divide the data into three clusters, the composition of each cluster would be:
Cluster Freq
1 42
2 104
3 57


  • assigning cluster to new columns
    Country 2-Clusters 3-Clusters
    Bulgaria 1 2
    Myanmar 1 3
    Burundi 2 1
    Belarus 1 2
    Cambodia 1 3
    Algeria 1 3
    Cameroon 2 1
    Canada 1 2
    Cabo Verde 1 3
    Central African Republic 2 1
    Sri Lanka 1 2
    Chad 2 1
    Chile 1 2
    China 1 2
    Colombia 1 2
    Comoros 2 1
    Mayotte 1 3
    Congo 2 1
    Dem. Rep. of the Congo 2 1
    Costa Rica 1 2
    Croatia 1 2
    Cuba 1 2
    Cyprus 1 2
    Czechia 1 2
    Benin 2 1
    Denmark 1 2
    Dominican Republic 1 3
    Ecuador 1 3
    El Salvador 1 2
    Equatorial Guinea 2 1
    Ethiopia 2 1
    Eritrea 2 1
    Estonia 1 2
    Faroe Islands 1 2
    Angola 2 1
    Fiji 1 2
    Finland 1 2
    France 1 2
    French Guiana 1 3
    French Polynesia 1 2
    Djibouti 2 3
    Gabon 2 3
    Georgia 1 2
    Gambia 2 1
    State of Palestine 1 3
    Germany 1 2
    Antigua and Barbuda 1 2
    Ghana 2 1
    Kiribati 1 3
    Greece 1 2
    Greenland 1 2
    Grenada 1 2
    Azerbaijan 1 3
    Guadeloupe 1 2
    Guam 1 2
    Argentina 1 2
    Guatemala 1 3
    Guinea 2 1
    Guyana 1 3
    Haiti 2 3
    Honduras 1 3
    China, Hong Kong SAR 1 2
    Hungary 1 2
    Iceland 1 2
    India 1 3
    Australia 1 2
    Indonesia 1 3
    Iran (Islamic Republic of) 1 2
    Iraq 1 3
    Ireland 1 2
    Israel 1 2
    Italy 1 2
    Côte d’Ivoire 2 1
    Jamaica 1 2
    Japan 1 2
    Kazakhstan 1 3
    Afghanistan 2 1
    Austria 1 2
    Jordan 1 3
    Kenya 2 1
    Dem. People’s Rep. Korea 1 2
    Republic of Korea 1 2
    Kuwait 1 3
    Kyrgyzstan 1 3
    Lao People’s Dem. Rep.  1 3
    Lebanon 1 3
    Lesotho 2 1
    Latvia 1 2
    Liberia 2 1
    Libya 1 2
    Bahamas 1 2
    Lithuania 1 2
    Luxembourg 1 2
    China, Macao SAR 1 2
    Madagascar 2 1
    Malawi 2 1
    Malaysia 1 2
    Maldives 1 3
    Mali 2 1
    Malta 1 2
    Martinique 1 2
    Mauritania 2 1
    Bahrain 1 2
    Mauritius 1 2
    Mexico 1 2
    Mongolia 1 3
    Republic of Moldova 1 2
    Montenegro 1 2
    Bangladesh 1 3
    Morocco 1 3
    Mozambique 2 1
    Armenia 1 2
    Oman 1 3
    Namibia 2 3
    Barbados 1 2
    Nepal 1 3
    Netherlands 1 2
    Curaçao 1 2
    Aruba 1 2
    New Caledonia 1 2
    Vanuatu 1 3
    New Zealand 1 2
    Nicaragua 1 2
    Belgium 1 2
    Niger 2 1
    Nigeria 2 1
    Norway 1 2
    Micronesia (Fed. States of) 1 3
    Palau 1 2
    Pakistan 2 3
    Panama 1 2
    Papua New Guinea 2 3
    Bermuda 1 2
    Paraguay 1 3
    Peru 1 2
    Philippines 1 3
    Poland 1 2
    Portugal 1 2
    Guinea-Bissau 2 1
    Timor-Leste 2 1
    Puerto Rico 1 2
    Qatar 1 3
    Réunion 1 2
    Bhutan 1 3
    Romania 1 2
    Russian Federation 1 2
    Rwanda 2 3
    Saint Lucia 1 2
    Saint Vincent & Grenadines 1 2
    Sao Tome and Principe 2 3
    Bolivia (Plurin. State of) 1 3
    Saudi Arabia 1 3
    Senegal 2 1
    Serbia 1 2
    Seychelles 1 2
    Sierra Leone 2 1
    Bosnia and Herzegovina 1 2
    Singapore 1 2
    Slovakia 1 2
    Viet Nam 1 2
    Slovenia 1 2
    Somalia 2 1
    South Africa 1 3
    Zimbabwe 2 1
    Botswana 1 3
    Spain 1 2
    South Sudan 2 1
    Sudan 2 1
    Western Sahara 1 3
    Suriname 1 3
    Swaziland 2 1
    Sweden 1 2
    Switzerland 1 2
    Brazil 1 2
    Syrian Arab Republic 1 2
    Tajikistan 1 3
    Thailand 1 2
    Togo 2 1
    Tonga 1 3
    Trinidad and Tobago 1 2
    United Arab Emirates 1 2
    Tunisia 1 2
    Turkey 1 2
    Turkmenistan 1 3
    Albania 1 2
    Uganda 2 1
    Ukraine 1 2
    TFYR of Macedonia 1 2
    Egypt 1 3
    United Kingdom 1 2
    United Rep. of Tanzania 2 1
    Belize 1 3
    United States of America 1 2
    United States Virgin Islands 1 2
    Burkina Faso 2 1
    Uruguay 1 2
    Uzbekistan 1 3
    Venezuela (Boliv. Rep. of) 1 2
    Samoa 1 3
    Yemen 2 1
    Zambia 2 1
    Solomon Islands 1 3
    Brunei Darussalam 1 2

Biplotting

Performing PCA on the Data


Dimension eigenvalue percentage of variance cumulative percentage of variance
comp 1 3.9451216 78.902432 78.90243
comp 2 0.6727678 13.455355 92.35779
comp 3 0.1954976 3.909951 96.26774
comp 4 0.1354516 2.709031 98.97677
comp 5 0.0511615 1.023230 100.00000



The dimension 1 contains 80% of information and dimension 2 contains 12% information. The total is arround 92% of information.

Variables Contribution


Cluster Plot

2 Clusters


When we divide the data into 2 clusters, we can conclude that the cluster 1 is :

  • the countries which have low life expectancy for male and female

  • the countries which have high fertility rate

  • the countries which have high population increase

  • and African countries dominate this cluster.

This cluster indicates the countries contained maybe are not a healthy country since they have low life expectancy. This countries will have more young people in the future since the are high fertility rate and the population grows rapidly.

cluster 2 is :

  • the countries which have high life expectancy for male and female

  • the countries which have low fertility rate

  • the countries which have low infant mortality number

  • the countries which have low maternal mortality ratio

  • and all Europe countries are in cluster2.

This cluster indicates the countries contained will tend to have less productive people in the future since the fertility rate is not really good and the population is not growing well. In this case, high life expectancy will make this countries population dominated by old people one day.

3 Clusters


When we divide the data into 3 clusters, we can conclude that the cluster 1 is :

  • the countries which have low life expectancy for male and female

  • the countries which have high fertility rate

  • the countries which have high population increase

  • African countries still dominate this cluster

This cluster is not really different with the cluster 1 from the case before.

cluster 2 is :

  • the countries in the middle, their observation values are near the average.

  • there are some outliers in this cluster. they are countries with high population growing and low infant mortality, the “oil well” i’ve told you before.

cluster 3 is :

  • the countries which have high life expectancy for male and female

  • the countries which have low fertility rate

  • the countries which have low population increase

  • the countries which have low maternal mortality ratio

This cluster indicates the countries contained will more likely to have less young people than countries in other clusters. the have low pop. increase, fertility rate. These countries should be more “productive”.

Animated Plot

So we’re going to see the animated plot of each country of each cluster from 2005 to 2015. We expect to see some countries change their cluster from time to time.



##                  Country                   clust2                   clust3 
##                        0                        0                        0 
##                     year                 inf.mort            life.exp.both 
##                        0                       17                       16 
## maternal.mortality.ratio             pop.increase          tot.fertil.rate 
##                       74                        5                        5 
##                Continent 
##                        0
##                  Country                   clust2                   clust3 
##                        0                        0                        0 
##                     year                 inf.mort            life.exp.both 
##                        0                        0                        0 
## maternal.mortality.ratio             pop.increase          tot.fertil.rate 
##                        0                        0                        0 
##                Continent 
##                        0
## 'data.frame':    541 obs. of  10 variables:
##  $ Country                 : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
##  $ clust2                  : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
##  $ clust3                  : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
##  $ year                    : chr  "2005" "2010" "2015" "2005" ...
##  $ inf.mort                : num  89.5 76.7 68.6 21.1 16.8 ...
##  $ life.exp.both           : num  56.9 60 62.3 74.8 75.6 77.7 71.5 73.9 75.3 50 ...
##  $ maternal.mortality.ratio: num  821 584 396 30 30 29 148 147 140 705 ...
##  $ pop.increase            : num  4.4 2.8 3.2 -0.3 -0.9 -0.1 1.3 1.6 2 3.5 ...
##  $ tot.fertil.rate         : num  7.2 6.4 5.3 1.9 1.6 1.7 2.4 2.7 3 6.6 ...
##  $ Continent               : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...


Plotting

##     inf.mort       life.exp.both     maternal.mortality.ratio
##  Min.   :-1.0953   Min.   :-2.7946   Min.   :-0.7531         
##  1st Qu.:-0.8480   1st Qu.:-0.6549   1st Qu.:-0.7024         
##  Median :-0.3681   Median : 0.2701   Median :-0.5165         
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000         
##  3rd Qu.: 0.7282   3rd Qu.: 0.7158   3rd Qu.: 0.4170         
##  Max.   : 2.8433   Max.   : 1.5739   Max.   : 3.2301         
##   pop.increase      tot.fertil.rate  
##  Min.   :-2.50753   Min.   :-1.2305  
##  1st Qu.:-0.66838   1st Qu.:-0.7849  
##  Median :-0.07722   Median :-0.3393  
##  Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.57962   3rd Qu.: 0.6791  
##  Max.   : 8.46171   Max.   : 2.9707
##     inf.mort       life.exp.both     maternal.mortality.ratio  pop.increase    
##  Min.   :-1.0369   Min.   :-2.6949   Min.   :-0.7804          Min.   :-2.7139  
##  1st Qu.:-0.8089   1st Qu.:-0.6741   1st Qu.:-0.7142          1st Qu.:-0.7343  
##  Median :-0.3700   Median : 0.2131   Median :-0.4225          Median :-0.0744  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000          Mean   : 0.0000  
##  3rd Qu.: 0.7055   3rd Qu.: 0.6937   3rd Qu.: 0.1243          3rd Qu.: 0.6588  
##  Max.   : 2.9556   Max.   : 1.4947   Max.   : 3.5298          Max.   : 3.8116  
##  tot.fertil.rate  
##  Min.   :-1.1726  
##  1st Qu.:-0.7470  
##  Median :-0.3215  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6359  
##  Max.   : 3.2245
## 'data.frame':    541 obs. of  7 variables:
##  $ PC1      : num  -4.8 -3.38 -2.48 1.48 1.8 ...
##  $ PC2      : num  -0.674 -0.027 -0.489 0.799 1.122 ...
##  $ Country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
##  $ clust2   : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
##  $ clust3   : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
##  $ year     : num  2005 2010 2015 2005 2010 ...
##  $ Continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...
  • 2 Clusters


Some countries are moving from cluster 1 to cluster 2.


There are some countries change their cluster.

The cluster position are flipped, its because the “var” plot is different


They’re flipped 180 degrees for each arrow, so the information gained from animated plot is still valid anyway.


Recommendation

Based on the previous analyst, I recommend to use 3 cluster because it gives us some more information. The 2 cluster is too general while the 3 cluster is more specific.

The use of 2 cluster only give us information that there are 2 groups of country, the first which have high life expectancy, low fertility rate, and low pop. increase. and the other one is the opposite.

But when we use 3 cluster we can see the middle cluster between the extremes.